Multi-Action Dialog Policy Learning from Logged User Feedback

نویسندگان

چکیده

Multi-action dialog policy (MADP), which generates multiple atomic actions per turn, has been widely applied in task-oriented systems to provide expressive and efficient system responses. Existing MADP models usually imitate action combinations from the labeled multi-action samples. Due data limitations, they generalize poorly toward unseen flows. While reinforcement learning-based methods are proposed incorporate service ratings real users user simulators as external supervision signals, suffer sparse less credible dialog-level rewards. To cope with this problem, we explore improve MADPL explicit implicit turn-level feedback received for historical predictions (i.e., logged feedback) that cost-efficient collect faithful real-world scenarios. The task is challenging since provides only partial label limited particular predicted by agent. fully exploit such information, propose BanditMatch, addresses a feedback-enhanced semi-supervised learning perspective hybrid objective of SSL bandit learning. BanditMatch integrates pseudo-labeling better space through constructing full feedback. Extensive experiments show our improves over state-of-the-art generating more concise informative source code appendix paper can be obtained https://github.com/ShuoZhangXJTU/BanditMatch.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Counterfactual Evaluation and Learning from Logged User Feedback

متن کامل

Batch learning from logged bandit feedback through counterfactual risk minimization

We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfa...

متن کامل

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

متن کامل

Discovering Action-Dependent Relevance : Learning from Logged Data

In many learning problems, the decision maker is provided with various (types of) context information that she might utilize to select actions in order to maximize performance/rewards. But not all information is equally relevant: some context information may be more relevant to the decision problem at hand. Discovering and exploiting the most relevant context information speeds up learning, red...

متن کامل

Joint Optimization of User-desired Content in Multi-document Summaries by Learning from User Feedback

In this paper, we propose an extractive multi-document summarization (MDS) system using joint optimization and active learning for content selection grounded in user feedback. Our method interactively obtains user feedback to gradually improve the results of a state-of-the-art integer linear programming (ILP) framework for MDS. Our methods complement fully automatic methods in producing highqua...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i11.26636